The aim of this dataset is to predict a given individual’s yearly medical insurance bill in America. I got my dataset from Kaggle. The data contains 6 predictors: age, sex, bmi, number of children, smoker status, and region. I will be fitting multiple models to this regression problem in order to determine an individuals yearly medical insurance cost as well as what factors most greatly affect a person’s medical insurance cost.
Medical insurance is a greatly disputed topic in America as we charge much more for health care than other countries. Additionally, healthcare and hospital costs are increasing, making this a prominent issue as people are in more urgent need to know their medical insuranace costs, to see if they can afford it or if they qualify for aid from the government, for example. The cost of certain healthcare products (such as drugs) depend on external factors, such as market forces. This creates some uncertainty and variability in a person’s medical bill because some people might be more susceptible to requiring these drugs, but the cost of the drugs is changing based on the market. I am hoping that I can fit a model that best identifies how much a person will have to pay yearly in medical insurance. I believe this is relevant knowledge as healthcare is especially costly in America, and poeple should be aware of how much they will have to pay in case they need to apply for aid from the government to pay their medical bills.
I obtained the dataset from Kaggle. The data contains 7 columns (6 predictor variables and 1 outcome variable) and 1338 rows. The following provides a description of the 7 variables:
1.charges: The outcome variable describing the amount an
individual has to pay in medical insurance in dollars yearly.
2.age: The age of the primary beneficiary.
3.sex: The gender of the insurance contractor, either male
or female. 4.bmi: Body mass index of primary beneficiary
5.children: Number of children covered by the health
insurance (also known as number of dependents) 6.smoker:
Whether the person smokes or not: yes or no 7.region:
Beneficiary’s location of residence in the US: northeast, southeast,
southwest, or northwest
## # A tibble: 1,338 × 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
## 7 46 female 33.4 1 no southeast 8241.
## 8 37 female 27.7 3 no northwest 7282.
## 9 37 male 29.8 2 no northeast 6406.
## 10 60 female 25.8 0 no northwest 28923.
## # … with 1,328 more rows
any(is.na(bill))
## [1] FALSE
The original data does not contain any missing/NA values.
## # A tibble: 1,070 × 7
## age sex bmi children smoker region charges
## <dbl> <fct> <dbl> <dbl> <fct> <fct> <dbl>
## 1 28 male 33 3 no southeast 4449.
## 2 32 male 28.9 0 no northwest 3867.
## 3 25 male 26.2 0 no northeast 2721.
## 4 23 male 34.4 0 no southwest 1827.
## 5 19 male 24.6 1 no southwest 1837.
## 6 23 male 23.8 0 no northeast 2395.
## 7 30 female 32.4 1 no southwest 4150.
## 8 18 male 34.1 0 no southeast 1137.
## 9 23 male 17.4 1 no northwest 2775.
## 10 18 female 26.3 0 no northeast 2198.
## # … with 1,060 more rows
## # A tibble: 268 × 7
## age sex bmi children smoker region charges
## <dbl> <fct> <dbl> <dbl> <fct> <fct> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 31 female 25.7 0 no southeast 3757.
## 4 37 female 27.7 3 no northwest 7282.
## 5 37 male 29.8 2 no northeast 6406.
## 6 27 male 42.1 0 yes southeast 39612.
## 7 56 male 40.3 0 no southwest 10602.
## 8 37 male 28.0 2 no northwest 6204.
## 9 59 female 27.7 3 no southeast 14001.
## 10 22 male 35.6 0 yes southwest 35586.
## # … with 258 more rows
I converted all character variables to type factor so that they will be easier to use when modeling and creating visualizations. I split the data into a training and testing set, and stratified on the outcome variable, charges, so that we have a well proportioned range of charges per set. The training dataset contains 1,070 observations, while the testing dataset contains 268 observations. Since we only have 6 predictor variables that all seem particularly useful, we do not have to worry about eliminating any non important variables from the dataset. Instead, let’s start exploratory data analysis on the training set.
Let’s start of by creating a correlation matrix of the numeric
variables in order to see which one’s seem to be more closely related
with our outcome variable, charges and examine their
relationship further.
Here we can see that age and BMI have a noticeable positive correlation with charges. I am surprised that the number of children a beneficiary has does not seem to affect insurance cost that greatly. This is surprising because I assumed that each kid would mean another insurance bill, but perhaps there are plans for families, so that they do not have to pay unreasonably expensive amounts of money.
Now let’s look at the distribution of age, which I believe will be an
important predictor variable.
The bar plot indicates that we have a lot of 17-19 year olds in our
training set, but the count of the rest of the ages are relatively the
same. This is important to note because we do not want an imbalanced
class, meaning we do not want only a large amount of 20 year olds (for
example) in our training set as our model will only learn to predict the
insurance of bill of 20 year olds and won’t be able to as accurately
predict costs for other ages. Let’s look at how age compares with
insurance costs, as the correlation matrix indicated that age is the
predictor with the highest correlation to charge.
Based off this simple line plot we can see that as age increases, insurance charges also increase in an positive linear trend. However, there seem to be three different trends, with around 5 outliers that indicate high insurance charges. It makes sense that as age increases, so does insurance cost as people are more prone to illnesses as they grow old, resulting in higher insurance bills. It is interesting to note the three different trends we see: for example, a 20 year old’s insurance cost can fall in either three categories of around $1000, around $2000, or around $4000.
The correlation matrix also indicated that the BMI of an individual
also plays a big factor in their insurance bill as it indicates how
healthy their body weight is, in turn indicating how prone they are to
health risks.
We can see that the median BMI of the individuals from out training set
is slightly over 30. A BMI of 30 is considered as overweight according
to the National Health and Nutrition Survey [cite this]. This makes
sense for out dataset as it is obtained in America, a country known for
having high obesity rates. We have a few outliers towards the 45 - 50
BMI range, most likely indicating a high medical insurance bill. This
also aligns with the outliers we saw in our scatter plot, as certain
individuals had significantly high insurance bills in comparison to the
rest of their age group.
Let’s see how insurance charges directly compare with BMI values.
There seems to be no clear trend between BMI and insurance charges.
Let’s see if a trend emerges if we split this scatter plot based on the
sex variable.
The trend remains relatively the same with males having a greater number
of higher insurance charges. This leads me to conclude that BMI does not
have a significant effect on yearly insurance costs. Just to double
check, let’s see if there is a trend in insurance costs and BMI when
split by region.
There seems to be no apparent trend here. One particular observation is
that the arrangement of points between the southern regions is similar
to each and the points between the northern regions are more similar to
each other. For example, the northeast and northwest regions both seem
to have a similar amount of points in the 30 - 40 BMI range. This makes
sense as insurance charges are more likely to be similar in regions
closer to each other.
I believe that an individuals location of residence
(region) also plays a big factor in insurance costs as some
places have higher cost of healthcare, resulting in a higher bill. Let’s
look at the distribution of the regions we have.
We have around the same amount of observations per region. Now lets see
how insurance charges differ per region.
The scatterplot form is displaying a similar pattern per region. To gain
more insights into general insurance cost trends across region, let’s
create a new column reassigning the charges to new values based off a
range that reassigns charges to one of 6 values. These ranges are going
to be determined by percentiles using the mean and standard
deviation.
## charges_new
## 10% 2330.629
## 20% 3989.458
## 30% 5555.493
## 40% 7442.786
## 50% 9373.744
## 60% 11444.155
## 70% 13662.588
## 80% 20189.108
## 90% 34316.836
## [1] 2330.629
The above visualization is a histogram depicting counts of how many
charges fall within a certain percentile range. For example, in the
northeast, about 29 insurance charges fall in the 80th percentile but
above the 70th percentile, meaning 29 individuals there pay an insurance
cost greater than 80% of the population, indicating a high insurance
cost. The 10 indicates the number of charges that fall greater than the
90th percentile; these are the highest insurance charges.This allows us
to see that the southwest region has relatively lower insurance charges
as there are not many insurance charges that fall in the 90th - greater
than 90th percentile region. The southeast region, on the other hand,
has a significantly higher number of charges that fall above the 90th
percentile. However, the southeast also has the highest number of
charges falling below the 10th percentile. Looking at the distribution
chart created in the beginning, we see that this is due to the fact that
the southeast simply has more observations, so it makes sense that it
has more counts in certain percentiles. Overall, the counts are pretty
even across region and percentiles. This indicates that region is most
likely not an extremely important predictor in determining health
insurance costs.
Just to double check, let’s look at the standard deviation and average insurance charges across all regions.
## region average insurance cost standard deviation
## A northeast 13256.5469111923 10883.9694023189
## B southeast 14540.1674274216 13621.30163839
## C northwest 12496.1164836015 11205.6792015472
## D southwest 12247.0496788168 11328.7241409813
Here we can see that the southeast has a significantly higher insurance cost. This is further supported by the histogram we made above indicating that majority of the individuals living in the southeast pay insurance charges greater than the 90th percentile (indicated by the 10 on the x axis), meaning they pay insurance costs greater than 90% of the population. The standard deviation, however, is also fairly large for the southeast region. This is also supported by the above histogram as the second highest bar was the below the 10th percentile bar. Therefore, it makes sense that the standard deviation is so high as the charges for that region are fairly spread out. The standard deviation and mean for the rest fo the regions are relatively the same. This leads to the conclusion that there seems to be no significant effect of region on insurance costs.
Now let’s take a look at how the smoker status of an individual
affects his/her medical insurance cost. First, let’s start off by
checking whether we have an imbalanced class or not by looking at the
distribution of the smoker variable.
We have a significantly higher number of non smokers in our testing
dataset. Since smoker is a predictor variable, and not our target
variable, it is not detrimental to our analysis/model if it is
imbalanced. Let’s see how much percentiles of charges vary based on
smoker status. Similar to the last plot we looked at, but this time we
are looking at smoker status as opposed to region.
Recall, the higher the percentile, the more expensive the medical
insurance bill is. Here we can see that majority of the smoking
population has an insurance bill that falls above the 90th percentile
(indicated by 10 on the x axis). Among the non smoking population,
however, we can see that majority of them fall in 60th percentile in
insurance bills. Therefore, it is reasonable to assume that a person’s
smoker status affects how much they have to pay yearly in medical
insurance.
When looking at the distribution of charges between male and female,
they seem to both have the same median of a little below $10,000.
However, the male population has a significantly higher 75th percentile
value. This means that 75% of insurance costs for males fall below
$19,000 a year, while 75% of female insurance costs fall below $15,000 a
year. This means that males seem to have to pay higher insurance rates a
year. Let’s see whether these values differ by region.
The above diagram shows the distribution of charges per region,
separated by sex. We can see that the southeast has significantly higher
charges for males than any other region. However, females in the
northeast have a higher median insurance cost than any other females in
other regions. Another interesting thing to note is that males have a
higher average insurance charge for every region except for the
northwest where they fall slightly below the females. While these are
interesting observations to note, they do not signify a clear trend or
pattern between sex and charge or region and charge or the intersection
between the three of them. Let’s take a look at the averages of
insurance charges between males and females overall and see if they
differ drastically.
## sex average insurance cost standard deviation
## A male 13879.2429471667 12920.1889691683
## B female 12443.683915434 10671.6975185499
As expected, males have a higher average insurance cost by a little over $1000. They also have a slightly higher standard deviation. Overall, I believe sex plays a role in determining how much an individual has to pay in health insurance, but it does not have as significant of an effect that age does. I am curious to see what a histogram of age vs charges separated by sex looks like.
The overall trends remain the same for each gender, yet males seem to
have more points in the $30,000 to $40,000 range. This is expected as
males have a higher average insurance rate. Also, we can see that both
genders have similar number of outliers (three for females, and two for
males).
Here we can see the clear difference in charges between people who do
smoke vs people that do not smoke that is apparent in every region. One
unique thing to note is that the median charge for smokers in the
northwest is considerably lower than the median insurance charge for all
smokers in different regions.
As the last component of our EDA, let’s take a look at how the number
of children a beneficiary covers affects his/her insurance bill. The
correlation matrix indicated that charges and
children did not have a high correlation. This was
surprising to me as I thought with an increased number of children a
beneficiary has to cover, the price of the insurance bill also increases
in order to account for each additional person.
Here we can see that majority of the individuals in our training set are
not covering other individuals. Let’s look at mean and standard
deviation to see just how much the averages differ per number of
children.
## number of children average insurance cost standard deviation
## A zero 12283.7132616376 11675.7584313941
## B one 12377.2423431559 11295.9968752119
## C two 15631.040996455 13316.6450266661
## D three 14696.82651528 11863.0073040667
## E four 14245.0147609524 9576.92581529077
## F five 8448.06313214286 3043.13882000087
These values were not what I expected them to be. I expected
individuals with more children to have higher yearly insurance costs as
they have more people to cover. It turns out, however, that people with
3 children have to pay the highest in average insurance cost yearly,
while people with 5 children pay the most. There does not seem to be a
clear trend to indicate that with an increase in the number of children
comes an increase in average insurance cost. This matches the assumption
made in the correlation matrix in the beginning as we can see that
charges and children has a very low correlation. Let’s see if there is
any trend between charges, children and smoker status.
We can see that individuals that smoke have noticeably higher medical
insurance charges regardless of the number of children. The ordering of
the y axis (from top to bottom) indicates the greatest to least average
charges per number of children. Once again, we can see that there is no
trend indicating that the more children you have the more one pays in
medical insurance. Based off this graph and the correlation matrix
created at the beginning of the EDA, we can conclude that the number of
children one covers does not seem to directly affect insurance
costs.
Luckily, the original dataset was already clean, so we do not need to worry about cleaning it. I already split the training and testing data prior to EDA, so all we need to do now is create a recipe to fit our data to models.
In order to build a recipe for our model I am going to include all 6
original predictors the dataset came with. We are going to use
step_normalize() to center and scale all our predictors. I
will also be using step_dummy() to convert the independent
categorical variables to dummy variables.
Now we are going to use cross validation because it reduces
computation time, reduces bias, and the variance of the estimate
decreases as k increases. We will be stratifying on the outcome
variable, charges, so that we do not have imbalanced
classes per fold.
In order to fit our recipe to models, we are going to build a
workflow, specify the engine, and set mode to regression. Due to the
relatively small number of predictors I have, I figured I would start
off with more simple models such as linear regression and logistic
regression and eventually work my way up to random forest and regression
tree. I chose to fit logistic regression over ridge regression as ridge
regression is best used when there is collinearity between predictors.
The results from the correlation matrix at the beginning of the EDA
indicate low correlation between predictors, so I ruled out ridge
regression. I have also decided to use lasso regression as I believe
some predictors, such as region, will not be particularly useful, and
lasso regression is best used for feature selection as well. We will be
assessing model performance primarily through rmse (root
mean square deviation) and rsq (R squared) metrics as these
are most commonly used to assess regression models.
First let’s start off with linear regression.
#Linear Regression
## # A tibble: 9 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13168. 182. 72.5 0
## 2 age 3532. 184. 19.2 5.77e- 71
## 3 bmi 2198. 193. 11.4 1.86e- 28
## 4 children 634. 182. 3.48 5.28e- 4
## 5 sex_male -20.2 183. -0.110 9.12e- 1
## 6 region_northwest -122. 224. -0.545 5.86e- 1
## 7 region_southeast -528. 234. -2.25 2.45e- 2
## 8 region_southwest -381. 225. -1.69 9.18e- 2
## 9 smoker_yes 9415. 183. 51.5 8.01e-291
## # A tibble: 6 × 1
## .pred
## <dbl>
## 1 6844.
## 2 5693.
## 3 3257.
## 4 4821.
## 5 806.
## 6 1897.
## # A tibble: 6 × 2
## .pred charges
## <dbl> <dbl>
## 1 6844. 4449.
## 2 5693. 3867.
## 3 3257. 2721.
## 4 4821. 1827.
## 5 806. 1837.
## 6 1897. 2395.
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 5919.
## 2 rsq standard 0.751
Due to the fact that many of the points do not fall on the dotted line, we can conclude that our linear regression model did not fit well. Let’s see if lasso regression performs better as it drops any predictors deemed unimportant. We also have an extremely high rmse value of 6013. An rmse score between 0.2 and 0.5 is ideal as it indicates that the model can accurately predict data.
## # A tibble: 5 × 5
## penalty .metric mean n std_err
## <dbl> <chr> <dbl> <int> <dbl>
## 1 79.1 rmse 5961. 10 67.5
## 2 59.6 rmse 5961. 10 66.8
## 3 45.0 rmse 5962. 10 66.3
## 4 33.9 rmse 5963. 10 65.9
## 5 105. rmse 5963. 10 68.4
The above visualization indicates that as the amount of regularization approaches 15, the rmse reaches a peak and then sharply decreases. Meanwhile, the rsq hits a low and starts rapidly increasing. We want a model with low rmse and high rsq as that indicates good model fit. The best rmse we see is at 6035.118. While rmse value is lower than our previous linear regression model, it is still extremely high, indicating poor model fit.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 5198.
## Warning: Cannot retrieve the data used to build the model (model.frame: 'data' must be a data.frame, not a matrix or an array).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
The regression tree model produces an rmse of 5026.74. This is lower
than the lasso linear regression model we looked at before, indicating
that this model produces a better fit, yet it is still a poor fit to our
data. To see if we can find a better decision tree, we are going to tune
the cost_complexity parameter.
We can see that rmse increases as the cost complexity parameter
increases the rmse also increases and the rsq decreases. This is exactly
the opposite of what we want, so we can conclude that lower values the
cost complexity parameter produce a better fit model. Now we select the
model with the best rmse and fit the model on the whole training
set.
## # A tibble: 1 × 2
## cost_complexity .config
## <dbl> <chr>
## 1 0.00464 Preprocessor1_Model07
## # A tibble: 5 × 5
## cost_complexity .metric mean n std_err
## <dbl> <chr> <dbl> <int> <dbl>
## 1 0.00464 rmse 4645. 10 208.
## 2 0.00167 rmse 4677. 10 184.
## 3 0.000599 rmse 4810. 10 180.
## 4 0.00001 rmse 4848. 10 200.
## 5 0.0000774 rmse 4848. 10 196.
The rmse of our best performing regression tree was 4710. This is lower than the other two models we have considered before, so let’s keep it in mind as we fit the bagging model.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4965.
The bagging model (which is a random forest model but all the predictors
are used) produces an rmse of 4852.515 on the testing set. The scatter
plot also shows a pretty good model fit as majority of the points land
on the line.
Here we can see that as the number of predictors increases, the rmse
decreases, which is a positive result. Additionally, the smaller the
node size the better the results for the rmse. Let’s see what model
parameters produced the best rmse value.
## # A tibble: 5 × 6
## mtry min_n .metric mean n std_err
## <int> <int> <chr> <dbl> <int> <dbl>
## 1 6 50 rmse 4568. 10 176.
## 2 6 61 rmse 4575. 10 178.
## 3 6 55 rmse 4576. 10 177.
## 4 6 66 rmse 4593. 10 181.
## 5 5 50 rmse 4598. 10 177.
Here we can see that the model with the lowest average rmse of 4567.638 has 6 all predictors (mtry = 6) and min_n = 50. This is a high rmse, but it is among the lowest of the models that we have tested so far. Let’s see if a KNN model fits better.
Here we can see that as the number of nearest neighbors increases, the
rmse decreases, which is what we want. The lowest rmse we get is 5500
when the number of nearest neighbors is 5. Let’s see how well the
boosted trees model works. # Boosted Trees
## # A tibble: 1 × 4
## mtry min_n learn_rate .config
## <int> <int> <dbl> <chr>
## 1 6 50 0.1 Preprocessor1_Model211
## # A tibble: 5 × 7
## mtry min_n learn_rate .metric mean n std_err
## <int> <int> <dbl> <chr> <dbl> <int> <dbl>
## 1 6 50 0.1 rmse 6078. 10 196.
## 2 6 60 0.1 rmse 6105. 10 202.
## 3 5 50 0.1 rmse 6188. 10 200.
## 4 6 70 0.1 rmse 6252. 10 208.
## 5 5 60 0.1 rmse 6309. 10 202.
Based off the parameters we chose to specify, we can see that when we use all 6 predictors, the min_n is 50, and learning rate is .01, we get the model with the lowest rmse of 6078.161.
Now that we have fit all our proposed models to the training data set and found the best rmse values per model, we can see that the random forest model produced the lowest rmse value of 4567 on the training set. Let’s fit this model to our testing set and observe model performance.
The final scatter plot displaying the accuracy of the testing set is
relatively good. The higher the charges get, however, the poorer our
model predictions are. More of the points fall on the line when the
charges are below $20,000. Let’s see which predictors this random forest
model deemed the most important.
Smoker status (particularly if it was yes) had an overwhelmingly
important effect on determining insurance charges. Age, BMI, and
children also played an important part. It seems that the southwest
region, if anything had a negative impact on variable performance. This
aligns with the results of the correlation matric we produced at the
beginning of the EDA.
Overall, the final random forest model did a good job of predicting charges below $20,000 on the testing set. The model still produced a very high RMSE value of over 4000. However, all of the other models produced a value greater than this. This high RMSE indicates that our model is not able to account for the important features underlying our data. The regression tree, bagging, random forest and k nearest neighbors performed the best among all our models. The lasso regression, linear regression, and boosted trees performed poorly as they had the highest rmse values. I was not surprised that linear regression or lasso regression did not work well because the EDA revealed that there did not seem to be an obvious linear pattern between charges and any of the other predictors.
My next steps in order to improve this model would be to attain more predictor variables that would help predict charge better. Since our model is failing to account for important underlying features, I am hoping that including a wider variety of predictor variables will increase the chance of producing a model better equipped with a wider range of predictors used to determine insurance costs. Another next step I would look into is creating some interaction terms between certain predictors. There seemed to be no correlation between predictors in my current predictor set, but hopefully with the addition of more predictors, I can find some varialbes that are correlated in order to create interaction terms between them.